What I'm talking about today is this is basically a follow-up to a talk I gave a few years ago
at DEF CON 18 about looking at information that's freely available out there on the net
and doing some trending and analysis of it and trying to make something useful out of
it.
So a little bit about my background.
I'm currently the Director of Technology at the Center for Law Enforcement Technology
Training and Research, which is a nonprofit research center that got spun out of work
that I used to do when I was a professor at the University of Central Florida.
I was there for about ten years, and I, in the engineering program, taught computer engineering.
I developed the computer security curriculum there and did embedded systems, amongst some
other things.
Eventually moved away from teaching and more into research, and we ended up spinning out
that research into an independent nonprofit center.
I'm also CTO for Hoverfly Technologies.
And prior to this...
I used to work as a research associate up at the Institute for Security Technology Studies
at Dartmouth College.
So over the course of the last 20 years, some of the things that I've worked on are
up here on this list.
And you know, it took me quite a while to catch on to kind of what like the common theme
between all of the things I was working on, because I'm kind of slow to pick up on these
things at times.
And eventually, as I started...
putting it together and kind of realizing some of the same things that I was coming
across and the same things I was doing, I realized that all of this stuff from information
sharing that I'm working on now to hardware sensor networks to intrusion detection systems,
they really all rely on some of the basic concepts of sensor data collection and in
particular sensor fusion.
Because like everything that we're doing in...
all of those things that are listed up there, they're all based on taking some sort of sensor
and using it to try to get some measure of reality.
But the sensor always has some limitations.
Sometimes it's a significant one, sometimes it's not so bad.
But every sensor that we look at reality, including ourselves, including when we view
things, it's always got some sort of limitation, and it's one particular view, and that influences
the data we're seeing.
And you can get ‑‑ we have to work towards trying to get more meaningfulness out of the
data that we have.
One of the ways that we do this and one of the techniques that I find most versatile,
I would say, is sensor fusion where we take multiple sensors, we take multiple ways of
looking at the same thing and kind of put that together with the hope that we can take
the limitations of one observation and cancel it out with a different observation that has
a different set of limitations.
So at least that's the hope.
At least, you know, if we can put two halfway decent things together and get something that's
more than the sum of its parts.
So before I get kind of more into my stuff, I always feel like in this particular subject
that I have to give an acknowledgment to the guy that inspired kind of some of these thoughts
in my head.
And it was actually at DEF CON, way back at DEF CON 13, Broward Horne gave this talk on
meme mining for fun and profit.
And his problem, you know, all great ideas come out of a problem ‑‑ I mean, I guess
a lot of bad ideas come out of trying to solve a problem, too.
But his was a really good idea.
His problem was that he would find that he would, like, start learning some new technology,
some new tool, or at least it was new to him, and by the time he felt he had mastered it,
it was kind of on the way out, or the market, the job market was just saturated with people
doing that now, or it had just fallen by the wayside, nobody cared about it.
And he was always kind of struggling with trying to figure out what should I spend my
time studying?
What should I learn to kind of get ahead?
And he ended up kind of thinking about this as, like, everything's got this sort of saturation
curve where a trend starts happening and there's a little bit of chatter about it and eventually
it starts taking off.
Yeah.
And everybody hears about it when it's big and growing and then it kind of gets boring
and old.
But he wanted to try and identify these things earlier on.
And went through and did it.
This is a slide pulled out of his old presentation where what he would do is he would look at
news sources and forums and blogs for information and keywords and kind of pull those out and
see what was trending on there.
With the idea that that's kind of a precursor to seeing that early chatter about it, something
can take off.
This one in this particular case, this is the red line shows how many times the word
palladium showed up in news reports and forums.
And the blue is the price of palladium.
And you can see that clearly there was a lot of chatter about it before the price spiked
up.
And then it actually, the chatter dropped off before the price comes back down.
So it's a really good ‑‑ you know, apparently a really good indicator for predicting the
future there.
What's going on.
So anyway, that kind of ‑‑ that thought inspired me.
And when I was ‑‑ when I was teaching, I'd have students who would come to me and
they would want to know what do they need to ‑‑ what skills do they need to get
a good job and all of that.
And I tried to apply what Broward had done in a similar way.
By monitoring and observing trends ‑‑ and this is mostly single variable observation.
It's doing some correlation.
And it started off looking at Craigslist data.
Just because Craigslist is nice.
It's nicely available.
It's well organized by geographic location.
And you can go in in certain categories, like where they have the job postings in there.
It's categories by different types of jobs.
And I know, like, you know, Craigslist isn't necessarily the best place to look for jobs.
But it was ‑‑ kind of had some interesting properties in that it's a lot of small companies
that post on there that ‑‑ or maybe trying new things.
A lot of entrepreneurial companies, start‑ups, things like that are posting there.
Not so much the big ones.
So that actually tends to skew it a little bit more towards being elite.
A leading indicator.
Something that is pre‑‑ will come out a bit ahead of the curve.
So some of the things I ended up looking at, just because I found correlations in here,
were jobs, items for sale and adult services.
And I mean, I didn't ‑‑ I'm not saying I looked for adult services on Craigslist.
It's just my research took me there.
So ‑‑ so, you know, things I saw looked like this.
This is just ‑‑ this is an example.
This is just showing job postings by day.
date. And this is showing the dips you see there. This is a weekly trend. These are some
different cities. It goes kind of dead on the weekends. There's a spike on a Monday,
spike on a Friday. You see this kind of pattern. And it's okay, fine, whatever. It's kind of
boring but sort of interesting, not unexpected. But there are certain things that started
standing out when you look at this data. In this particular case, one of the things that
jumped out at me was Austin never had a spike on a Friday. It always dropped off. It's hard
to see, but it's the orange line in there. It never has a second spike in it. I thought
that was kind of interesting. The other thing, and this is what came out of the adult services,
was that there was a correlation between adult services being offered and bicycles being
for sale. Or actually a lot of items being for sale. And this led to a couple of interesting
discussions that were one of my favorite moments at DEF CON was when somebody stood
up in the audience and said, hey, I think I can help you out. I'm from Austin and my
sister is a prostitute.
So the ‑‑
The ‑‑
So that and then it led into a discussion of things you can sell one time like a bicycle
and something you can sell over and over and over again. So ‑‑ so, okay. That's
what I had done before. And we had looked at that. And there's some interesting stuff
there. But I wanted to kind of dig a bit deeper into the data and look for more relationships
and more correlations between data and hopefully be able to pull in other sources and do some
fusions on this. So I started looking for things like different cycles.
In like the job postings are correlations in them. Because at the time when I was working
on this ‑‑ keep in mind I was ‑‑ I was really trying to help out some of the
students that were graduating and looking for jobs, trying to help them find out what
skills they needed, what would really kind of help them get ahead. There were ‑‑ there
were definitely correlations in there. You know, there were things in the cycles you'd
see. But nothing unexpected. Nothing really interesting that jumped out in related skills.
You know, you can say ‑‑ like you could say that if a job was going to have one or
one particular tool set or skill set listed. There are other ones that are likely to be
listed with it as well. Again, it was nothing really jumped out at me as being unexpected
out of it. But eventually there were a couple of interesting things that showed up. One
that I think is just kind of funny. And it was how often the words drug test or drug
screen showed up in a job advertisement correlated with the different skills in it. And apparently
like, if you don't think you're going to pass a drug test, don't bother learning SAP because
it's not going to do you any good. On the other hand, if you want to develop IOS applications,
go knock yourself out. I guess there's probably some logic here is like how corporate or uncorporate
the environment is, I suppose. Another thing was looking at jobs that had benefits. And
like retirement and health and medical. You know, the interesting one, the best one was
COBOL but I think it was a bit of an outlier because there were just so few jobs offered
with COBOL and I guess to get like any like old grizzled COBOL programmer to come work
for you, you've got to give them a lot of benefits. You know, things like Python and
Android and HTML, looking for somebody to develop your web page, you're not going to
give them much in benefits, I suppose. So as I was looking into this, I came across
actually this is much more recently, this is earlier this year, I came across this article.
This is actually out of the Journal of Psychology where a psychologist, Dorothy Gambrill, was
doing something similar and actually went through and looked at the missed connections
part of Craigslist. If you haven't ever been there, this is where people say, oh, I saw
you as I was walking across the parking lot and tried to catch your eye. And then they
go and post this up on the Internet hoping that person will find this and somehow make
a connection with them. And these are organized by state. These are where people make or had
the most missed connections. And there's some things that just make me find it funny. Like
Wal-Mart's got a lock on the south, you know. You know, Oklahoma, it's the state fair,
of course. You know, it makes perfect sense. And, you know, in Nevada, it's casinos, you
know. And the one thing that I just had to put this up there, one thing that just jumped
out at me like crazy was Indiana. It's at home.
Like I don't know what they're doing in Indiana, but I'm pretty sure they're doing it wrong.
So I was talking with a friend of mine about this stuff, Dave Grubleski, and his eyes lit
up and he started telling me about this thing that he had done. And he said, you know, I'm
students, but I want to help people at home. And he did a general search and some users
in Florida, this is back in Orlando Florida, his neighborhood they had a rash of crime
recently. And they didn't really know they had a rash of crime until the neighbors got
together and started talking with each other. And they found out a whole bunch, a little
different incident that had happened. He went and did some searching and found out there
was some open source data that the sheriff's office and police department would post about
their dispatch calls. And he started writing this little tool to take that, do some geolocating
on it and tweet it out and then you can subscribe to it and get tweets from this thing, like
really hyper local things for your neighborhood about what's going on there.
And it's actually one thing that's funny, I just pulled this up earlier today and, you
know, I was just noticing things.
This is in Orlando area.
The first tweet that's on there, and I'm amazed that the sheriff's office is putting this
out, they're basically saying there's a designated patrol area available, which means there's
an area where there's nobody patrolling it currently.
And this is down like in a real tourist trap part of Orlando, so, you know, I mean, that
could be useful information to somebody to know there are no cops there right now.
And then there's a few accidents and then I guess the people at the bottom down on Poppy
Avenue would be happy to note there's a fugitive from justice running around in their area.
So this kind of led us to like look into more sources for data because what they offered
where we were wasn't very ‑‑
Wasn't very useful or organized.
And we found out ‑‑ and started looking at places that kind of subscribe more to the
open gov system, and this is a movement to have more transparent government data.
Some cities publish huge amounts of data about what's going on in their city with the
fire department, police department.
Live interesting data.
Seattle, Boston, Chicago, a number of others, these are three that we spent a bit of time
looking at.
There's information about incidents that are going on, like police fire.
In Chicago, you can actually track where the snow plows are in the city.
You can track where garbage trucks are in real time from the city, which I just find
really kind of fascinating.
There's information about where bicycle racks, public toilets, landmarks, and even where
cameras are, where the city has all of its cameras posted, which that one I thought was
actually particular and interesting, but you can really go on here and make a map of what
is an observable location throughout the city and what is not an observable location.
Which, again, that could be useful information for somebody.
Here's something, the Seattle one is great.
They've got their visualization tools built right into this thing.
And this is a map showing police incidents over a period of time around in part of Seattle.
And I pulled up this area, and you'll notice that, like, most of it, everything is kind
of in that same yellow-orange, except for this one big glowing red blob out there.
And you know, over in Georgetown.
I don't know if anybody is from Seattle here.
But I'm, like, wondering what the heck is going on over in Georgetown.
And you can look in a little bit closer, and right next to it is the Boeing propulsion
engineering labs, which, you know, that makes me feel really good.
So coming back to, like, an area I know a bit more about, back in Orlando, we pulled
up data that had ‑‑ we pulled out traffic tickets.
They don't publish information about, like, who got the ticket or exactly what the ticket
was for.
But you can see when there was a traffic stop occurred.
And I ‑‑ we looked at it and pulled data that covered three roads in the area.
And these are ‑‑ this is right out by the University of Central Florida.
These are three roads that they all run east-west, and they're kind of the three major roads,
just kind of ‑‑ one is right into the university, one is a bit north, one is a
bit south.
And they all have about the same amount of traffic on, and they all have a very similar
traffic pattern.
And when we went through ‑‑ and we were in the area, and we were in the area, and we
were ‑‑ what this chart is showing here is this is each one of the kind of groupings
is a week‑long period, all five weekdays.
And then it's repeated over six weeks.
And one of the things that I found really interesting was the chance of a traffic ticket
occurring on a ‑‑ on one of these roads, the order ‑‑ it was always likely at different
times of the day.
It always followed the same sort of pattern, particularly between this Highway 50 and University
Boulevard, that the Highway 50 traffic stops always preceded the University Boulevard traffic
stops.
And when you go out there and you look at the traffic, the traffic pattern is not really
any different.
So if you start thinking about this and start putting together, well, why do you always
see one before the other?
I don't have ‑‑ you know, I don't have hard evidence to back that up.
But what our belief is is that you're seeing an influence of the patrol pattern of the
police in the city.
So you're actually able to kind of get in there and through their information that they're
putting out sort of start tracking them.
It's kind of like, you know, there's a talk I went to earlier yesterday, I guess it was,
there's a great talk with Brendan O'Connor that was talking about tracking people by
seeing, like, information their devices are spitting out on wireless networks.
It's a similar concept, that they're putting out a lot of information here that is ‑‑ that
if you look at it the right way and you take the right pieces of data and put it together,
you can pull a lot more information out about what they're ‑‑ about what they're doing
and what's going on.
So you know, why ‑‑ so by this time I've kind of changed kind of what I was interested
in doing.
And probably because I quit teaching and I left the university, so I don't have students
anymore.
So I'm not that interested in helping people find jobs.
So now I found it kind of interesting to, like, look at these ‑‑ look at these
government entities and the police and other things that are going on.
And also because I've worked with law enforcement a lot.
And it's kind of interesting to see, like, how on one hand they're very protective of
their data, but at the same time they're putting out a lot of information that I'm not sure
that they quite ‑‑ I don't know.
I don't quite realize how much that they're putting out there.
Frankly, I think it's actually kind of a good thing.
I like being able to have more information and being able to look back on them.
And like I say, you know, why should the NSA have all the fun on spying on people?
So the ‑‑ what's next with this?
And there's so much more I'd like to talk about, but these 20‑minute talks you have
to be kind of fast in.
That what ‑‑ what I'm really ‑‑ you know, I don't know.
But I'm interested in is actually ‑‑ is expanding the model that we've been using
on this data to be analyzed.
We kind of built things that are very purpose‑driven that ‑‑ the first set of analysis we
did was very structured around the seeking out the jobs, doing that, and then kind of
got side‑tracked by the crime in going off that direction.
And I want to bring this back together.
And try to ‑‑ so we've got in the process a couple of time and a couple of times in the last few days.
to build a more robust model for analyzing this data and throw some data mining at this
where so far a lot of what we've done has been what I'd say is like hypothesis based
where I make a prediction about something I think I should see and there's some correlation
then go looking for it to try and see if it exists in the data or doesn't exist. And I'm
sure there's a lot of relations that are in there that are things that, you know, that
I wouldn't expect or I wouldn't find otherwise. I want to throw a bit of sort of data mining
and kind of that sort of blind either AI or brute force type approach to finding relations
throughout the data. So I think I'm about out of time right now and I'm getting a nod
from the back so I'll wrap it up there and if there are any questions I'd be happy to
take a couple until they cut me off.
Thank you.
